Bayesian inference provides an intuitive and self-consistent approach to statistical modeling. In short, you have beliefs about unobserved values (e.g., the impact of price on customer satisfaction) and you use data to update those beliefs. The barrier to using Bayesian inference in practice has never been this intuition – it’s been the required computation. There has been a great deal of progress on this front, with Stan arguably providing the best modern solution.

Stan is a probabilistic programming language that provides a general-purpose sampler using Hamiltonian Monte Carlo. In other words, Stan automates the required computation (for many models), allowing you to conduct Bayesian inference by focusing solely on model building. This is especially powerful when it comes to utilizing the mainstay of Bayesian inference: hierarchical models.

The goal of this post is to provide a gentle introduction to building hierarchical models in Stan. We will be interfacing with Stan through R and will be adopting a tidy approach whenever appropriate. This post does not provide an introduction to Bayesian inference, including a complete Bayesian inference workflow, or the basics of using Stan. For that, I recommend Richard McElreath’s Statistical Rethinking, Michael Betancourt’s case studies, and the Stan User’s Guide. Much of what follows is motivated by Richard and Michael’s work as well as Ben Goodrich’s StanCon 2019 tutorial.

Simple regression

By way of introduction, let’s start with a simple, non-hierarchical regression. In a Stan script, which has native support in RStudio, we specify the three required blocks for a Stan model: data, parameters, and model (i.e., the prior and the likelihood).

// Index value and observations.
data {
  int<lower = 1> N;  // Number of observations.
  vector[N] y;       // Vector of observations.
}

// Parameters.
parameters {
  real mu;           // Mean of the regression.
  real<lower=0> tau; // Variance of the regression.
}

// Regression.
model {
  // Priors.
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 2.5);

  // Likelihood.
  y ~ normal(mu, tau);
}

We’ll save this Stan script as regression.stan. Imagine this is a model of customer satisfaction where we have N individuals, a vector of satisfaction scores y with a single observation from each individual, and we are assuming that satisfaction y is distributed normal, with a single mean mu and variance tau to describe customer satisfaction in the population. Finally, our joint model is complete with a normal prior on mu and a cauchy prior on tau.

Like all Bayesian models, this model is generative, so we can also use Stan to generate data according to some assumed parameter values and then use the generated data to test the model, including demonstrating parameter recovery. To do this, we reorganize these three blocks into data (which now includes the assumed parameter values as data) and generated quantities blocks (which now includes the observations that will be generated).

// Index and parameter values.
data {
  int<lower = 1> N;  // Number of observations.
  real mu;           // Mean of the regression.
  real<lower=0> tau; // Variance of the regression.
}

// Generate data according to the simple regression.
generated quantities {
  vector[N] y;       // Vector of observations.

  // Generate data.
  for (n in 1:N) {
    y[n] = normal_rng(mu, tau);
  }
}

We’ll save this Stan script as generate_data_00.stan. The for loop over y in the generated quantities block emphasizes the strong assumption of this non-hierarchical model that a single mean mu and variance tau describe customer satisfaction for the entire population. If this doesn’t sit well with you, a hierarchical model will provide the cure. We’ll get there shortly.

In an R script, let’s load the necessary packages, allow Stan to use as many cores as we have available, allow for Stan to save compiled code, specify assumed parameter values, and generate data according to our simple regression by calling generate_data.stan.

# Load packages.
library(tidyverse)
library(rstan)
library(bayesplot)
library(tidybayes)

# Set Stan options.
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

# Specify data and parameter values.
sim_values <- list(
  N = 100, # Number of observations.
  mu = 5,  # Mean of the regression.
  tau = 1  # Variance of the regression.
)

# Generate data.
sim_data <- stan(
  file = here::here("Code", "generate_data.stan"),
  data = sim_values,
  iter = 1,
  chains = 1,
  seed = 42,
  algorithm = "Fixed_param"
)
SAMPLING FOR MODEL 'generate_data_00' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                0.000127 seconds (Sampling)
Chain 1:                0.000127 seconds (Total)
Chain 1: 

Stan provides verbose output noting progress as well as any warnings. Everything looks fine here, so we can go ahead and extract our simulated data.

# Extract simulated data.
sim_y <- extract(sim_data)$y

To test our model, we simply specify the simulated data as a list of data to be used as an input, call the regression model regression.stan from R, and Stan does all the heavy lifting for us.

# Specify data.
data <- list(
  N = length(sim_y),   # Number of observations.
  y = as.vector(sim_y) # Vector of observations.
)

# Calibrate the model.
fit <- stan(
  file = here::here("Code", "regression.stan"),
  data = data,
  seed = 42
)
...

SAMPLING FOR MODEL 'regression' NOW (CHAIN 4).
Chain 4: Gradient evaluation took 1.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Chain Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.031208 seconds (Warm-up)
Chain 4:                0.024732 seconds (Sampling)
Chain 4:                0.05594 seconds (Total)
Chain 4: 

Stan takes the three required blocks in regression.stan and creates a Hamiltonian Monte Carlo sampler to produce draws from the posterior. The output tracks progress across many iterations (2000 by default) and chains (4 by default), along with any problems. Here we have included printed for the fourth chain only and no problems have been identified. We can also check the trace plots.

# Check trace plots.
fit %>%
  mcmc_trace(
    pars = c("mu", "tau"),
    n_warmup = 500,
    facet_args = list(nrow = 2, labeller = label_parsed)
  )

We have good mixing and clear convergence across all chains for both of our model parameters.

Finally, we can demonstrate parameter recovery by plotting the marginal posteriors for each of our parameters to confirm that the assumed parameter values used when generating data are within reasonable, typically 95%, credible intervals.

# Recover parameter values.
par_values <- tibble(
  .variable = c("mu", "tau"),
  values = c(sim_values$mu, sim_values$tau),
)

fit %>%
  gather_draws(mu, tau) %>%
  ggplot(aes(x = .value, y = .variable)) +
  geom_halfeyeh(.width = .95) +
  geom_vline(aes(xintercept = values), par_values, color = "red") +
  facet_wrap(
    ~ .variable,
    nrow = 2,
    scales = "free"
  )

The assumed parameter values (in red) have been recovered by the model (they are contained within the customary 95% credible interval)! In summary, we’ve specified a simple regression model in Stan, generated data according to that model, and used the generated data to demonstrate that the model is working as intended.

Now let’s move on to a hierarchical model.

Simple hierarchical regression

So why use a hierarchical model? The most straightforward reason is that sinking feeling we have about the assumption that a single set of parameters will effectively describe the entire population. In terms of customer satisfaction, we know there is heterogeneity across consumers – not everyone behaves or thinks the same. At the very least, we can think about satisfaction being different for different customer groups or segments.

So what does a simple hierarchical regression look like in Stan?

// Index values and observations.
data {
  int<lower = 1> N;               // Number of observations.
  int<lower = 1> K;               // Number of groups.
  vector[N] y;                    // Vector of observations.
  int<lower = 1, upper = K> g[N]; // Vector of group assignments.
}

// Parameters and hyperparameters.
parameters {
  vector[K] beta;                 // Vector of group intercepts.
  real mu;                        // Mean of the population model.
  real<lower=0> tau;              // Variance of the population model.
  real<lower=0> sigma;            // Variance of the likelihood.
}

// Hierarchical regression.
model {
  // Hyperpriors.
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 2.5);

  // Prior.
  sigma ~ cauchy(0, 2.5);

  // Population model and likelihood.
  beta ~ normal(mu, tau);
  for (n in 1:N) {
    y[n] ~ normal(beta[g[n]], sigma);
  }
}

In the data block, we now have a vector g indicating which one of K groups each of our N individuals belong to. In the parameter block, we now have a K-dimensional vector of beta parameters, a separate mean for each of the K of groups. In the model block we can see that the likelihood (now within a for loop) is still assumed normal, but now each individual’s observed overall satisfaction has a mean of beta specific to the group they belong to. We also have a population model beta ~ normal(mu, tau) that says these separate, group-level beta coefficients are themselves drawn from a population that is assumed normal with a mean mu and variance tau.

This is what is meant by a hierarchy: there are two levels to our model, the lower-level likelihood and the upper-level population model. Finally, our joint model is complete with a prior on the population model parameters (formally referred to as hyperpriors, since they are priors on a prior): a normal hyperprior on mu and a cauchy hyperprior on tau. Note that we are not assuming a hierarchy to the likelihood variance, now called sigma, so it retains the same prior.

We can again translate this Stan script, hierarchical_regression_01.stan, into data and generated quantities blocks called generate_data_01.stan to use Stan to generate data for us.

// Index and hyperparameter values.
data {
  int<lower = 1> N;               // Number of observations.
  int<lower = 1> K;               // Number of groups.
  int<lower = 1, upper = K> g[N]; // Vector of group assignments.
  real mu;                        // Mean of the population model.
  real<lower=0> tau;              // Variance of the population model.
  real<lower=0> sigma;            // Variance of the likelihood.
}

// Generate data according to the hierarchical regression.
generated quantities {
  vector[N] y;                    // Vector of observations.
  vector[K] beta;                 // Vector of group intercepts.

  // Assign to a group, draw parameter values, generate data.
  for (k in 1:K) {
    beta[k] = normal_rng(mu, tau);
  }
  for (n in 1:N) {
    y[n] = normal_rng(beta[g[n]], sigma);
  }
}

Note that the data block now includes the assumed parameter values for the hyperpriors as data with beta being drawn alongside y in the generated quantities block. Let’s call this Stan script from R to generate data according to a hierarchical model.

# Specify data and hyperparameter values.
sim_values <- list(
  N = 100,                            # Number of observations.
  K = 3,                              # Number of groups.
  g = sample(3, 100, replace = TRUE), # Vector of group assignments.
  mu = 5,                             # Mean of the population model.
  tau = 1,                            # Variance of the population model.
  sigma = 1                           # Variance of the likelihood.
)

# Generate data.
sim_data <- stan(
  file = here::here("Code", "generate_data_01.stan"),
  data = sim_values,
  iter = 1,
  chains = 1,
  seed = 42,
  algorithm = "Fixed_param"
)
SAMPLING FOR MODEL 'generate_data_01' NOW (CHAIN 1).
Chain 1: Iteration: 1 / 1 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0 seconds (Warm-up)
Chain 1:                6.6e-05 seconds (Sampling)
Chain 1:                6.6e-05 seconds (Total)
Chain 1: 
# Extract simulated data and group intercepts.
sim_y <- extract(sim_data)$y
sim_beta <- extract(sim_data)$beta

We can test our simple hierarchical model using our new simulated data to fit the hierarchical regression from R.

# Specify data.
data <- list(
  N = length(sim_y),    # Number of observations.
  K = sim_values$K,     # Number of groups.
  y = as.vector(sim_y), # Vector of observations.
  g = sim_values$g      # Vector of group assignments.
)

# Calibrate the model.
fit <- stan(
  file = here::here("Code", "hierarchical_regression_01.stan"),
  data = data,
  seed = 42
)
Warning messages:
1: There were 6 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems

Our sampler has identified a problem. We only have a simple hierarchical model and already our unique Hamiltonian Monte Carlo diagnostic, a divergence, has identified an issue. There is a good reason for these divergences. The short story is that hierarchical models have posteriors with geometry that is difficult to navigate. This is true independent of the sampler, but it’s quickly evident when using Hamiltonian Monte Carlo.

To address this issue, our first step (and one that is helpfully suggested in the warning message above) is to increase the adapt_delta, which functions like a step size as the sampler navigates the posterior. A larger adapt_delta equates to a smaller step size, meaning the sampler will be more careful as it navigates but it will also take longer. We find that adapt_delta = 0.99 (it only goes to 1) results in no divergences.

# Calibrate the model.
fit <- stan(
  file = here::here("Code", "hierarchical_regression_01.stan"),
  data = data,
  control = list(adapt_delta = 0.99),
  seed = 42
)

We can now check trace plots, this time for both the population hyperparameters and the group parameters.

# Check trace plots.
fit %>%
  mcmc_trace(
    pars = c("mu", "tau", str_c("beta[", 1:data$K, "]"), "sigma"),
    n_warmup = 500,
    facet_args = list(nrow = 5, labeller = label_parsed)
  )

Finally, we can demonstrate parameter recovery for both the population-level hyperparameters and the group-level parameters.

# Recover hyperparameter and parameter values.
hyper_par_values <- tibble(
  .variable = c("mu", "tau", "sigma"),
  values = c(sim_values$mu, sim_values$tau, sim_values$sigma),
)

beta_values <- tibble(
  n = 1:data$K,
  beta = as.vector(sim_beta)
)

fit %>%
  gather_draws(mu, tau, sigma) %>%
  ggplot(aes(x = .value, y = .variable)) +
  geom_halfeyeh(.width = .95) +
  geom_vline(aes(xintercept = values), hyper_par_values, color = "red") +
  facet_wrap(
    ~ .variable,
    nrow = 3,
    scales = "free"
  )

fit %>%
  spread_draws(beta[n]) %>%
  ggplot(aes(x = beta, y = n)) +
  geom_halfeyeh(.width = .95) +
  geom_vline(aes(xintercept = beta), beta_values, color = "red") +
  facet_wrap(
    ~ n,
    nrow = 3,
    scales = "free"
  )

Both the assumed hyperparameter and parameter values have been recovered by the model! To summarize, we have now specified a simple hierarchical regression model in Stan, generated data according to that model, and used the generated data to demonstrate that the model is working.

Multiple hierarchical regression

In this next section, let’s move from simple hierarchical regression to multiple hierarchical regression by including observation-level covariates and a matrix of Beta coefficients.

// Index values, observations, and covariates.
data {
  int<lower = 1> N;               // Number of individuals.
  int<lower = 1> K;               // Number of groups.
  int<lower = 1> I;               // Number of observation-level covariates.

  vector[N] y;                    // Vector of observations.
  int<lower = 1, upper = K> g[N]; // Vector of group assignments.
  matrix[N, I] X;                 // Matrix of observation-level covariates.
}

// Parameters and hyperparameters.
parameters {
  matrix[K, I] Beta;              // Matrix of observation-level coefficients.
  real mu;                        // Mean of the population model.
  real<lower=0> tau;              // Variance of the population model.
}

// Hierarchical regression.
model {
  // Hyperpriors.
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 2.5);

  // Population model and likelihood.
  for (k in 1:K) {
    Beta[k,] ~ normal(mu, tau);
  }
  for (n in 1:N) {
    y[n] ~ normal(X[n,] * Beta[g[n],]', 1);
  }
}

In the data block, we now have an index I specifying the number of observation-level covariates and an N x I matrix X that contains those covariates (including an intercept). In the parameters block, Beta is now a K x I matrix of coefficients rather than a vector. The model block changes accordingly, with the population model in a for loop.

We can again translate this Stan script, hierarchical_regression_02.stan, into data and generated quantities blocks called generate_data_02.stan to use Stan to generate data, including X, for us.

// Index values, observations, and covariates.
data {
  int<lower = 1> N;               // Number of individuals.
  int<lower = 1> K;               // Number of groups.
  int<lower = 1> I;               // Number of observation-level covariates.

  int<lower = 1, upper = K> g[N]; // Vector of group assignments.
  real mu;                        // Mean of the population model.
  real<lower=0> tau;              // Variance of the population model.
}

// Generate data according to the hierarchical regression.
generated quantities {
  vector[N] y;                    // Vector of observations.
  matrix[N, I] X;                 // Matrix of observation-level covariates.
  matrix[K, I] Beta;              // Matrix of group-level coefficients.

  // Draw parameter values.
  for (k in 1:K) {
    for (i in 1:I) {
      Beta[k, i] = normal_rng(mu, tau);
    }
  }
  
  // Generate data, both X (including the intercept) and y.
  for (n in 1:N) {
    for (i in 1:I) {
      if (i == 1) X[n, i] = 1;
      if (i != 1) X[n, i] = uniform_rng(1, 7);
    }
    y[n] = normal_rng(Beta[g[n],] * X[n,]', 1);
  }
}

Let’s generate data by calling this Stan script from R.

# Specify data and hyperparameter values.
sim_values <- list(
  N = 100,                            # Number of individuals.
  K = 3,                              # Number of groups.
  I = 2,                              # Number of observation-level covariates.
  g = sample(3, 100, replace = TRUE), # Vector of group assignments.
  mu = 5,                             # Mean of the population model.
  tau = 1                             # Variance of the population model.
)

# Generate data.
sim_data <- stan(
  file = here::here("content", "post", "stan-hierarchical", "Code", "generate_data_02.stan"),
  data = sim_values,
  iter = 1,
  chains = 1,
  seed = 42,
  algorithm = "Fixed_param"
)

# Extract simulated data and group intercepts.
sim_y <- extract(sim_data)$y
sim_X <- extract(sim_data)$X
sim_Beta <- extract(sim_data)$Beta

Let’s fit the model and recover parameters.

# Specify data.
data <- list(
  N = sim_values$N,                  # Number of individuals.
  K = sim_values$K,                  # Number of groups.
  I = sim_values$I,                  # Number of observation-level covariates.

  # Matrix of observation-level covariates.
  X = matrix(
    as.vector(sim_X),
    nrow = sim_values$N,
    ncol = sim_values$I
  ),

  y = as.vector(sim_y),              # Vector of observations.
  g = sim_values$g                   # Vector of group assignments.
)

# Calibrate the model.
fit <- stan(
  file = here::here("content", "post", "stan-hierarchical", "Code", "hierarchical_regression_02.stan"),
  data = data,
  control = list(adapt_delta = 0.99),
  seed = 42
)

Note that we’ve preemptively set adapt_delta = 0.99 given the previous model’s performance. Let’s confirm things have worked by running diagnostics.

# Diagnostics.
source(here::here("Code", "stan_utility.R"))
check_all_diagnostics(fit)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"

Everything looks good! Let’s look at our expanded trace plots.

# Check trace plots.
par_string <- str_c("Beta[", 1:data$K, ",", 1, "]")
for (i in 2:data$I) {
  temp <- str_c("Beta[", 1:data$K, ",", i, "]")
  par_string <- c(par_string, temp)
}
fit %>%
  mcmc_trace(
    pars = c("mu", "tau", par_string),
    n_warmup = 500,
    facet_args = list(
      nrow = (data$I * data$K + 2) / 2,
      ncol = 2,
      labeller = label_parsed
    )
  )

Finally, we can demonstrate parameter recovery.

# Recover hyperparameter and parameter values.
hyperpar_values <- tibble(
  .variable = c("mu", "tau"),
  values = c(sim_values$mu, sim_values$tau),
)

par_values <- tibble(
  n = sort(rep(1:(data$K), data$I)),
  i = rep(1:(data$I), data$K),
  .variable = str_c("Beta", "_", n, "_", i),
  values = as.vector(t(matrix(sim_Beta, ncol = data$I)))
) %>%
  select(.variable, values)

fit %>%
  gather_draws(mu, tau) %>%
  ggplot(aes(x = .value, y = .variable)) +
  geom_halfeyeh(.width = .95) +
  facet_wrap(
    ~ .variable,
    nrow = 2,
    scales = "free"
  ) +
  geom_vline(aes(xintercept = values), hyperpar_values, color = "red")

fit %>%
  gather_draws(Beta[n, i]) %>%
  unite(.variable, .variable, n, i) %>%
  ggplot(aes(x = .value, y = .variable)) +
  geom_halfeyeh(.width = .95) +
  geom_vline(aes(xintercept = values), par_values, color = "red") +
  facet_wrap(
    ~ .variable,
    nrow = data$K,
    ncol = data$I,
    scales = "free"
  )

The model looks good! We have one more step to consider.

Hierarchical regression

In this final section, let’s fully embrace what it means to have a hierarchical model. With both an observational model and a population model, we can include both observation-level and population-level covariates. Furthermore, we have been assuming a known variance for both the observation-level and population-level models. Relaxing that assumption means another set of parameters to estimate. While there isn’t a concern about having “too many parameters,” since this will simply be expressed in terms of less certainty about any given parameter, we will run into issues with dimensionality.

Exciting, right? Let’s get started.

// PASTE.

In the data block, we now have an index J specifying the number of population-level covariates and an K x J matrix Z that contains those covariates (again, including an intercept). We have Gamma in the parameters block, a J x I matrix of population-level coefficients. The model block also changes, with the population model including Z[k,] * Gamma in place of mu.

We can again translate this Stan script, hierarchical_regression_03.stan, into data and generated quantities blocks called generate_data_03.stan to use Stan to generate data, including X and Z, for us.

// PASTE.

Let’s generate data by calling this Stan script from R.

# PASTE.

Let’s fit the model and recover parameters.

# PASTE.

Note that we again have adapt_delta = 0.99. I’ve also increased the number of iterations iter = 4000, double the default amount and half of which will be used for warmup.

Divergent transitions? Reparameterizing our hierarchical model such that the posterior geometry is easier to navigate. The non-centered parameterization is commonly used, imposing a deterministic transformation between layers in the hierarchy.

// PASTE.

Final thoughts

There are a few more things we should consider:

Hierarchical models should be our default approach in most applications and Stan removes the barriers to entry for the required Bayesian computation. Additional reasons to mention using hierarchical models:


Marc Dotson

Marc is an assistant professor of marketing at the BYU Marriott School of Business. He graduated with an MSc from The London School of Economics and Political Science in 2009 and a PhD from The Ohio State University in 2016. His research interests include Bayesian inference, predictive modeling, consumer preference heterogeneity, and unstructured data. Marc teaches marketing analytics to undergraduate and MBA students. You can find him on Twitter and GitHub.